ISYE6501 NotesΒΆ

Week 1ΒΆ

Module 1 - Intro & What is ModelingΒΆ

What is modeling?

  1. Describe a real-life situation mathematically
  2. Analyze the math
  3. Turn mathematical answer back into real-life situation

Can mean 3 things:

  1. General (regression)
  2. General with variables (regression using size, weight distance)
  3. Specific estimates (regression using bias + coefficients * variables)

Module 2 - ClassificationΒΆ

Error & cost:

  • Minimizing errors: Classification lines try to divide data points with the max amount of distance to account for small data variations that might lead to classification errors.
  • Should also consider weight of errors, between FPs and FNs. The more costly the error for either, the more distance we want to shift the line away from it. - Reflects "risk tolerance". Relevant in medical analytics.

Features:

  • If you see a straight horizontal/vertical line between 2 features, it means that the feature parallel to the line is not useful.

Data DefinitionsΒΆ

Tables

  • Every row is a "data point"
  • Each column of a data point contain "features", "attributes", "covariate", etc, ie. the X values.
  • Response/outcome is the "answer" of each data point, ie. the y value.

Structured Data

  • data stored in a structured way, either quantitatively (age) or categorically (gender, hair color)

Unstructured Data

  • Data not easily described and stored.
  • Example: Text

Common types of structured data

  1. Quantitative - numbers with meaning
  2. Categorical
    • Numbers without meaning (ex. zipcodes)
    • Non-numeric: Hair color
  3. Binary data (subset of categorical data)
    • Can only take one of two values
    • Can sometimes be treated as a quantitative measure
  4. Unrelated data - no relationship between data points
  5. Time series data - same data recorded over time (stock, height of a child)
    • Often recorded at same interval (but not necessarily)

Support Vector MachinesΒΆ

Applied to credit loan problem:

m = number of data points
n = number of attributes
x_ij = jth attribute of ith data point
  x_i1 = credit score of person 1
  x_i2 = income of person i
y_i = response for data point i
  y_i = 1 (if data point i is blue, -1 if data point i is red)
line = (a_1 * x_1) + (a_2 * x_2) + (a_n * x_n) + a_0 = 0

Notes:

  • Need to scale the data. If not, certain coefficients will be far more sensitive to change than others.
  • If a certain coefficient (feature) is near-zero, it is probably not relevant for classification.
  • Works the same in more dimensions (ie. more attributes)
  • Classifier doesn't have to be a straight line. Can use kernel methods to make non-linear classifiers.
  • Logistic regression is better for getting probability of classification (e.g. 37% likely to default).

Scaling dataΒΆ

Scaling linearly:

alt

Scaling to a normal distribution:

alt

Which method to use? Depends:

  • Use scaling for data in a bounded range.
    • neural networks
    • optimization models that need bounded data
    • batting average (0.000-1.000)
    • RGB color intensities (0-255)
    • SAT scores (200-800)
  • Use standardization for PCA / Clustering.
  • Sometimes it won't be clear. Try both!

Handwritten notes for SVMΒΆ

img img

KNN AlgorithmΒΆ

Idea:

  1. Find the class of a new data point
  2. Pick the k-closest points ('nearest neighbors') to the new one
  3. The new point's class is the most common among the k-neighbors

Things to keep in mind:

  • Can be used for multi-class classification tasks.
  • "K" is a parameter that can be tweaked.
  • There is more than one way to calculate distance metrics.
  • Some attributes can be weighted by importance.

alt

Classification SummaryΒΆ

  • Divides data points into groups based on similarity
  • Graphical intuition
  • Basic solution methods
    • SVM
    • KNN
  • Data terminology
  • Validation
  • Distance metrics
  • Confusion matrices

Week 2ΒΆ

Module 3 - ValidationΒΆ

Data has 2 types of patterns:

  1. real effects - real relationship between attributes and response
  2. random effect - random, but looks like a real effect

"Fitting" matches both:

  • real effects: same in all data sets
  • random effects: different in all data sets
  • BUT: only real effects are duplicated in other data.

How to measure a model's performance:

  • larger set of data to fit the model
  • smaller set to measure the model's effectiveness

Train/Validate/Test to choose best modelΒΆ

What if we want to compare 5 runs of SVM and KNN?

alt

Problems:

  • Observed performance = real quality + random effects. High-performing models are more likely to have above-average random effects.
  • Just choosing best performing model is too optimistic.

Solution:

  • Use both validation and test data set.
  • Training:Build, Validation:Pick, Test:Estimate performance

alt

Splitting Data - Random/RotationΒΆ

How much data to use?

  • Typically 70-90% training, 10-30% test.
  • BUT to compare models: 50-70% train, split the rest equally between validation and test.

How to split data?

  1. Random: Randomly sample split (60:20:20) without replacement.
  2. Rotation: Take turns selecting points (e.g. 5 data point rotation sequence). Benefit: Equally separates data. Cons: May introduce bias, if rotations match a pattern (e.g. workdays)

Splitting Data - k-Fold Cross ValidationΒΆ

Idea: For each of the k parts - train the model on all the other parts and evaluate it on the one remaining.

  • Pros: Better use of the data (trains across all available data), better estimate of model quality and more effective way to choose a model.
  • For 4-fold, use 3 (ABC) for train, 1 (D) for test. Then rotate each fold to use as validation (see image). Essentially, each fold is used k-1 times for training, and 1 time for validation
  • Finally, average the k evaluations to estimate the model's quality.
  • Most common k is 10.
  • Important: Don't use the resulting models from k-fold cross validation, either by choosing one of the trained model or averaging the coefficients across the k splits. Train the model again using all the data.

alt

Module 4 - ClusteringΒΆ

Definition: Grouping data points

Use:

  • Targeted marketing / market segmentation (market on size / price / versatility / aesthetics)
  • Personalized medicine
  • Locating facilities
  • Image analysis (face recognition, captcha)
  • Data investigation

Distance NormsΒΆ

Euclidean (straight-line) distanceΒΆ
  • 1-dimension $$ Distance = \sqrt(x_1-y_1)^2+(x_2-y_2)^2 = $$
Rectilinear distanceΒΆ
  • 2-dimension
  • Used for calculating distance in a grid.
  • Also called "Manhattan distance"
$$ Distance = |x_1 - y_1| + |x_2 - y_2| $$
p-norm Distance (Minkowski Distance)ΒΆ
  • n-dimension
  • Sum over all n dimensions
$$ Distance = \sqrt[p]{\sum_{i=1}^{n}|x_i - y_i|^p} $$
Infinity NormsΒΆ
  • Definition: "The infinity norm simply measures how large the vector is by the magnitude of its largest entry." (reference)
  • It is essentially p-norm distance where p is set to infinity.
$$ Distance = \sqrt{\sum_{i-1}^{n}|x_i - y_i|^{\infty}} = \sqrt[\infty]{\text{max}_i|x_i - y_i|^{\infty}} = max_i|x_i - y_i| $$

Why use infinity norms?

  • Robotics example: Automated storage and retrieval system. Time it takes to store/retrieve depends on moving to the aisle (horizontal) and arm height (vertical).
  • A "one-norm" would be the robot moving to the aisle, then lifting its arm to the correct height.
  • An "infinity-norm" would be the robot moving + stretching its arms and the max time is dictated by whichever takes longer to complete.

K-Means AlgorithmΒΆ

How it works:

  1. Pick k cluster centers within range of data
  2. Assign each data point to nearest cluster center
  3. These data points are not the actual "center" of the cluster. To find the center, recalculater cluster centers (centroids). These centroids are the new centroids.
  4. After centroids are set, some data points might not be in the right cluster (determined by closeness to each centroid). Keep repeating steps 2 and 3 until there no more data point changes clusters.

Characteristics:

  • k-means is a heuristic: Fast, good but not guaranteed to find the absolute best solution.
  • Expectation-maximization - "Expectation" is the mean of a cluster's data point distance to the centroid. "Maximization" is the task - maximizing the negative to a cluster center.
K-means in practiceΒΆ
  1. Outliers: k-means assigns whichever closest cluster to the outlier. Easiest way to solve is remove those outliers. But, best practice is to better understand why those outliers are there.
  2. It's fast: Take advantage of its runtime by running several iterations.
    • For each run, choose a different initial cluster center (randomized initialization), then choose the best solution found.
    • Test different values of k as well.
  3. Understanding k for optimization.
    • Make sure to "fit the situation you're analyzing". k == # of data points may be the most theoretically optimal, but does that actually make sense for the task?
    • Solution: Plot k vs total distance to find the "elbow" of the curve. At a certain number, the benefit of adding another k becomes negligible.
K-means for predictive clusteringΒΆ

Classification task: Given a new data point, determine which cluster set the new data point belongs to. To do this, simply put it into whichever cluster centroid it is closest to.

Another classification task: What range of possible data points would we assign to each cluster?

Image of cluster region, aka "Voronoi diagram" img

Classification / Clustering - DifferencesΒΆ

  • Difference is what we know about the data points
  • In classification, we know the correct classification of attributes. This is called supervised learning.
  • In clustering, we don't know the correct classification of attributes. This is called unsupervised learning.

img

Week 3ΒΆ

Module 5 - Basic Data PreperationΒΆ

Data Preparation: Quantitative Examples

  • credit scores
  • average daily temperature
  • stock price

Things to watch out for before building models:

  • Scale of the data: Can throw off models. Beware of possible outliers.
  • Extraneous information: Complicates the model, harder to correctly interpret the solution.

Outlier DetectionΒΆ

Definition: Data point that's very different from the rest.

Types of outliersΒΆ
  • point outliers: values far from the rest of the data.
    • Example: Outlier point in a cluster scatterplot
  • contextual outlier: value isn't far from the rest overall, but is far from points nearby in time.
    • Example: Time series data, low temperature at an unexpected time.
  • collective outlier: Something is missing in a range of points, but can't tell exactly where.
    • Example: Missing hearbeat in a time series chart of ECG data.
How to detect outliersΒΆ
1. Box-and-whisker plotΒΆ
  • helps find outliers (for 1D data)
  • box is 25/75th percentile of values
  • middle of the box is the median
  • "whiskers" expanding outside of box is the reasonable range of values expected of the data (e.g. 10/90th or 5/95th percentile)
  • Data points outside of the whisker could be considered outlier data.

img

2. Modeling errorΒΆ
  • Idea: Fit model, then find points of high error
  • Example: Fit exponential smoothing model to temperature/time series data.
  • Points with very large error might be outlier. Model will expect smooth curve, but actual data is far from it.

img

Dealing with OutliersΒΆ

Depends on the data:

  • Bad data (failed sensor, bad experiment, bad input)... or maybe real data?
  • Need to root cause - possible avenues: find data source, understand how it was compiled.
  • With large data, outliers are expected.
    • In a normally-distributed data, 4% of data will be outside 2 standard deviations.
    • With 1,000,000 data points, >2000 expected outside 3 standard deviations.
    • Removing them can be too optimistic, account for noise.

Actions:

  • If really bad data, remove the data, use data imputation
  • If real/correct data, think whether the data you're modeling could realistically have outlier data points (e.g. force majeure incidents causing delays/cancellations).
  • Modeling outliers: Using logistic regression model, estimate probability of outliers happening under different conditions. Then, use a second model to estimate length of delivery under normal conditions, using data without outliers.

Module 6 - Change DetectionΒΆ

Definition: Determining whether something has changed.

Why:

  • Determine if action is needed (preventative maintenance)
  • Determine impact of past action (did discount increase sales?)
  • Determine changes to help planning (did voting patterns change?)
  • Hypothesis tests often not sufficient for change detection, because they are slow to detect change.

Cumulative sum (CUSUM) approachΒΆ

Definition: Detect increase/decrease or both by the cumulative sum, "Has the mean of cumulative sum gone beyond a critical level"

How:

  • The basic idea is to calculate a running total of the change $t-1$, where t is the number of time steps.
  • If the running total is greater than zero we can keep it, otherwise the running total will reset to zero. This allows change to be detected quickly.
  • The running total ($S_t$) can then be used as a metric to determine if it goes above a threshold $T$.
  • $C$ is a buffer to pull the running down to prevent too many false positives from occurring.

Terms:

  • $S_t$ = Running total of observed values
  • $X_t$ = observed value at time $t$
  • $\mu$ = mean of x, if no change
  • $C$ = buffer value
  • $T$ = threshold valued
  • $X_t - \mu$ = how much above expected the observation is at time $t$
  • $\mu - X_t$ = how much below expected the observation is at time $t$
Formula and Example:ΒΆ

img

Formula to detect increase (True if $S_t \ge T$)

$$ S_t = max{\left\{0, S_{t-1} + (X_t-\mu-C) \right\}} $$

Formula to detect decrease (True if $S_t \ge T$). $\mu$ and $X_t$ is flipped.

$$ S_t = max{\left\{0, S_{t-1} + (\mu-X_t-C) \right\}} $$

Note: Both can be used in conjunction to create a control chart, where $S_t$ is plotted over time and if it ever gets beyond this threshold line, it shows that CUSUM detected a change.

Week 4ΒΆ

Module 7 - Time Series ModelsΒΆ

Exponential SmoothingΒΆ

Time series data will have a degree of randomness. Exponential smoothing accounts for this by smoothing the curve.

Example:

  • $S_t$ - expected baseline response a time period $t$
  • $X_t$ - observed response at $t$
  • What is real blood pressure over time without variation?
  • Different blood pressure, increase in baseline, or random event?
  • Ways to answer:
    • $S_t = X_t$ - observed blood pressure is real indicator
    • $S_t = S_{t-1}$ - today's baseline is the same as yesterday's baseline

Exponential Smoothing ModelΒΆ

  • Sometimes called single, double, triple depending on how many aspects are considered (seasonality, trend)
  • Triple exponential smoothing is also called Winter's Method or Holt-Winters.
FormulaΒΆ
$$ S_t = \alpha x_t + (1 - \alpha)S_{t-1} $$$$ 0 < \alpha < 1 $$
  • $\alpha$ trades off between trusting $x_t$ when $\alpha$ is large, and trusting $S_{t-1}$ when $\alpha$ is small.
  • $\alpha$ -> 0: A lot of randomness in the system.
    • Trust previous estimate $S_{t-1}$
  • $\alpha$ -> 1: Not much randomness in the system.
    • Trust what you see $x_t$.
  • How to start? Set initial condidation: $S_1 = x_1$
  • Does not deal with trends or cyclical variations - more on that later.

Time series complexities

  • Trends (increase/decrease)
  • Cyclical patterns (temp/sales/biometric cycles)

Trends

Exponential Smoothing, but with $T_t$ (trend at time period $t$): $$S_t = \alpha x_t + (1 - \alpha)(S_{t-1} + T_{t-1})$$

Trend at time $t$ based on delta between observed value $S_t$ and $S_{t-1}$ with a constat $\beta$. $$T_t = \beta (S_t - S_{t-1}) + (1-\beta)T_{t-1}$$

  • Initial condition is $T_1 = 0$

Cyclic

Two ways to calculate:

  1. Like trend - additive component of formula, or
  2. Seasonalities: multiplicative way where:
    • $L$: Length of a cycle
      • e.g. $L=24$ for hourly reading of biometrics a day
    • $C_t$: Multiplicative seasonality factor for time $t$.
      • inflates/deflates the observation

Baseline formula (including trend + seasonality)

$$ S_t = \frac{\alpha x_t}{C_{t-L}} + (1 - \alpha)(S_{t-1} + T_{t-1}) $$

Update the seasonal (cyclic) factor in a similar way as trends:

$$C_t = \gamma(\frac{x_t}{S_t}) + (1 - \gamma)C_{t-L}$$
  • No initial cyclic effect: $C_1$, ..., $C_L$ = 1

Example: Sales trends

  • If $C$ is 1.1 on Sunday, then sales were 10% higher just because it was Sunday
  • Of 550 sold on Sunday, 500 is baseline, 50 is 10% extra

Starting Conditions

For trend:

  • $T_1 = 0$: No initial trend

For multiplicative seasonality

  • Multiplying by 1: shows no initial cyclic effect
  • First $L$ values of $C$ set to 1.

Exponential Smoothing: What the Name MeansΒΆ

"Smoothing"

img

  • The equation smoothes high and low spikes using $\alpha$.
    • If we have high observed value $x_t$, the baseline estimate $S_t$ is not as high. High value is only weighted by $\alpha$ which is less than one, and pulled down from high point by previous baseline $S_{t-1}$.
    • If we have low observed value, $(1-\alpha)S_{t-1}$ term pulls the estimate up from the low observed value.

Graph of what it looks like: img

"Exponential"

Each time period estimate can be plugged in like this: img

  • Each time period is weighted by $1-\alpha$ to an increasing exponent.
  • Significance: Every past observations contributes to the current baseline estimate $S_t$.
  • More recent observation are weighted more, i.e. more important.

Applications in ForecastingΒΆ

Given basic exponential smoothing equation

$$S_t = \alpha x_t + (1-\alpha)S_{t-1}$$

We want to make a prediction $S_{t+1}$. Since $X_{t+1}$ is unknown, replace it with $S_t$.

Using $S_t$, the forecast for time period $t+1$ is

$$F_{t+1} = \alpha S_t + (1-\alpha)S_t$$

hence, our estimate is the same as our latest baseline estimate

$$F_{t+1} = S_t$$

Factoring in trend/cycle

Above equation can beextrapolated to trend/cycle calculations.

Best estimate of trend is the most current trend estimate:

$$F_{t+1} = S_t + T_t$$

Same for cycle (multiplicative seasonality)

$$F_{t+1} = (S_t + T_t) C_{(t+1)-L}$$

Where $F_t+k = (S_t + kT_t)C_{(t+1)-L+(k-1)}$, k=1,2,...

AutoRegressive Integrated Moving Average (ARIMA)ΒΆ

3 key parts

1. Differences

  • Exponential smoothing assumes data is stationary, but reality is that it is not.
  • However, the difference might be stationary.
  • For example:

    • First-order difference $D_{(1)}$: difference of consecutive observations

    $$D_{(1)t} = (X_t - X_{t-1})$$

    • Second-order difference $D_{(2)}$: differences of the differences

    $$D_{(2)t} = (x_t - x_{t-1}) - (x_{t-1} - x_{t-2})$$

    • Third-order difference $D_{(3)}$: differences of the differences of the differences

    $$D_{(3)t} = [(x_t - x_{t-1}) - (x_{t-1} - x_{t-2})] - [(x_{t-1} - x_{t-2}) - (x_{t-2} - x_{t-3}]$$

    • The difference of differences = d times

2. Autoregression

Definition: Predicting the current value based on previous time periods' values.

Augoregression's exponential smoothing:

  • order-$\infty$ autoregressive model.
  • uses data all the way back.

Order-p autoregressive model:

  • Go back only p time periods.

"ARIMA" combines autoregression and differencing

  • autoregression on the differences
  • use p time periods of previous observations to predict d-th order differences.

3. Moving Average

  • Uses previous errors $\epsilon_t$ as predictors $$\epsilon_t = (\hat{x} - x_t)$$
  • Apply order-q moving average (go back q time periods) $$\epsilon_{t-1}, \epsilon_{t-2}, \ldots , \epsilon_{t-q}$$

ARIMA (p,d,q) model

$$ D_{(d)t} = \mu + \sum_{i=1}^{p}\alpha_i D_{(d)t-i} - \sum_{i=1}^{q}\theta_i(\hat{x}_{t-1} - x_{t-i}) $$

Choose:

  • p-th order autoregression
  • d-th order difference
  • q-th order moving average
  • Once chosen, statistical software can be used to find the p, d, q constants through optimization.

Other flavors of ARIMA

  • Add seasonal values of p, d, q
  • Set specific values of p, d, q to get certain qualities
    • ARIMA(0,0,0) = white noise
    • ARIMA(0,1,0) = random walk
  • Can generalize a lot of simpler models:
    • ARIMA(p,0,0) = AR (autoregressive) model
    • ARIMA(0,0,q) = MA (moving average) model
    • ARIMA(0,1,1) = basic exponential smoothing model
  • Short-term forecasting
    • Usually better than exponential smoothing if:
    • Data is more stable, with fewer peaks/valleys/outliers.
  • Need ~40 past data points for ARIMA to work well.

Generalized Autoregressive Conditional Heteroskedasticity (GARCH)ΒΆ

Definition: Estimate or forecast the variance of something, given a time-series data.

Motivation:

  • Estimate the amount of error.
  • Example: Forecast demand for pickup trucks
    • Variance tells how much forecast might be higher/lower than true value
  • Example: Traditional portfolio optimization model.
    • Balances expected return of a portfolio with amount of volatility. (Risker: Higher expected return, vice-versa)
    • Variance is a proxy for the amount of volatility/risk.

Mathematical Model:

$$ \sum_t^2 = \omega + \Sigma_{i-1}^p\beta_i\sigma_{t-i}^2 + \sum_{i=1}^q \gamma_i\epsilon_{t-i}^2 $$
  • Equation is similar to ARIMA, with 2 differences
    1. Uses variance/squared errors, not observations/linear errors.
    2. Uses raw variances, not differences of variances.
  • Stat software can fit the model, given p and q (don't need d since GARCH doesn't use differences)

SUMMARYΒΆ

  1. Exponential smoothing
  2. ARIMA - generalized version of exponential smoothing
  3. GARCH - ARIMA-like model for analyzing variance

Week 5ΒΆ

Basic RegressionΒΆ

What it explains:

  • Causation / Correlation
    • Value of a home run
    • Effect of economic factors on presidential election
  • Prediction
    • Height of child at adulthood
    • Future oil price
    • Housing demand in next 6 months

Simple Linear Regression (SLR)ΒΆ

Definition: Linear regression with one predictor

  • Looks for linear relationship between predictor and response.

img

Sum of Squared ErrorΒΆ

Example:

  • $y_i$ = cars sold for data point i
  • $\hat{y}_i$ = model's prediction of cars sold
  • Data point i's prediction error: $y_i - \hat{y}_i = y_i - (a_0 + a_1 x_{i1})$

Sum of Squared Errors $$ \sum_{i=1}^n(y_i - \hat{y}_i)^2 = \sum_{i=1}^n(y_i - (a_0 + a_1 x_{i1})^2 $$

Best-fit regression line

  • line moves around the dimension to minimize sum of squared errors. As it moves, the values of the coefficients $a_0$ and $a_1$ change.
  • Defined by $a_0$ and $a_1$

Maximum likelihood (or how to measure the quality of a model's fit)ΒΆ

  • Likelihood: Measure the probability (density) for any parameter set.
  • Maximum likelihood: Parameters that give the highest probability.

How it works:

  • Assume that the observed data is the correct value and we have information about the variance.
  • Then for any set of params we can measure the probability (density) that the model would generate the estimates it does.
  • Whichever set of params that gives the highest probability density (ie. maximum likelihood). is the best-fit set of params.

Maximum Likelihood - ExampleΒΆ

  • Assumption: distributino of errors is normal with mean of zero and variance $\sigma^2$, and independently and identically distributed.
  • Observation: $z_1$, ..., $z_n$
  • Model estimates: $y_1$, ..., $y_n$
  • Maximum-liklihood expection: The set of parameters that minimizes the sum of squared errors

img

MLE in Linear RegressionΒΆ

img

  • $x_{ij}$ is the _j_th observed predicted value of data point $i$.
  • $a_0$ through $a_m$ are the params we're trying to fit.

Maximum Likelihood FittingΒΆ

  • Simple example: regression with independent normally distributed errors
  • But can get complex with:
    • different estimation formulas
    • different assumptions about the error
  • Stat software can handle most of these complexities.

Models that Combine Maximum Likelihood with Model ComplexityΒΆ

Akaike Information Criterion (AIC)ΒΆ

  • $L^*$: Maximum likelihood value
  • $k$: number of parameters being estimated
$$ AIC = 2k - 2\ln{(L^*)} $$
  • $2k$ is the penalty term. Balances model's likelihood with its simplicity. This is done to reduce the chance of overfitting by adding too many parameters.
  • Models with the smallest AIC is preferred.

AIC applied to regression

  • $m+1$: number of parameters
  • $L^*$ (max likelihood val) substitue: Linear regression function.

img

Corrected AIC ($AIC_c$)ΒΆ

  • AIC weakness: Works well if there are infinitely many data points.
  • Correct AIC accounts for this by adding a correction term.

Equation:

  • $n$: Correction term
$$ AIC_c = AIC + \frac{2k(k+1)}{n-k-1} = 2k - 2\ln{(L^*)} + \frac{2k(k+1)}{n-k-1} $$

Comparing AIC / AICc models by Relative ProbabilityΒΆ

Example:

  • Model 1: AIC = 75
  • Model 2: AIC = 80

Relative likelihood $$ e^\frac{(AIC_1 - AIC_2)}{2} $$

Applied to Models 1 & 2: $$ e^\frac{(75 - 80)}{2} = 8.2\% $$

Result:

  • Model 2 is 8.2.% as likely as Model 1 to be better.
  • Considering lower AIC is better, Model 1 will be the better choice

Bayesian Information Criterion (BIC)ΒΆ

  • $L^*$: Maximum likelihood value
  • $k$: Number of parameters being estimated
  • $n$: Number of data points
$$ BIC = k \ln{(n)} - 2 \ln{(L^*)} $$

Characteristics:

  • Similar to AIC
  • BIC's penalty term is greater than AIC.
  • Encourage models with fewer params than AIC.
  • Only use BIC when there are more data points than parameters

BIC Metrics - Rule of thumb

img

Summary of AIC / BICΒΆ

  • AIC = Frequentist, BIC = Bayesian
  • No hard-and-fast rule for using AIC or BIC or maximum likelihood.
  • Looking at all three can help you decide which is best.

Understanding Regression CoeficientsΒΆ

Baseball Example: Determine average number of runs a home run is worth.

  • Response: How many runs a team scored taht season
  • Predictors: # of home runs, triples, doubles, singles, outs, etc.

Equation:

$$ Runs Scored = a_0 + a_1[Number of HR] + a_2[Number of Triples] + \ldots + a_m[Other Predictors] $$

Applications of LR:

  1. Descriptive Analytics
  2. Predictive Analytics
  3. ~Prescriptive Analytics~ (Not used for this)

Causation vs CorrelationΒΆ

Causation: One thing causes another thing Correlation: Two things tend to happen or not happen together - neither of them might cause the other.

Application in Linear Regression:

  • x: city's avg daily winter temp
  • y: hours/days spent outdoors in winter
  • x may have causation over y (warmer avg temp -> more time outdoors in winter)
  • y does not have causation over y (more time outdoors in winter -> warmer avg temp)

How to decide if there is causation?

  • Cause is before effect.
  • Idea of causation makes sense (subjective)
  • No outside factors causing the relationship (hard to guarantee)

Transforming Data for Generalized LR ModelingΒΆ

img

  • Problem: LR won't fit data on right.
  • Solution: Transform the data

Method 1: We can adjust the data so the fit is linear.

  • Quadratic Regression
$$ y = a_0 + a_1x_1 + a_2x_1^2 $$

Method 2: We can transform the response

  • Transform response with logarithmic function $$ \log(y) = a_0 + a_1x_1 + a_2x_2 + \ldots + a_mx_m $$

  • Box-Cox transformations (link) can be automated in statistical software.

When to Transform DataΒΆ

Example: Variable interaction

  • Ex: Estimating 2-yr old's adult height.
  • Predictors:
    • Father's height
    • Mother's height
    • Product of the parents' heights ($x_1x_2$)
  • We can the third predictor as a new input $x_3$
$$ y = a_0 + a_1x_1 + a_2x_2 + a_3(x_1x_2) $$

How to Interpret OutputΒΆ

1: P-ValuesΒΆ

  • Estimates the probability that coefficient might be zero (ie. type of hypothesis testing).
  • Rule of thumb: If coefficient's p-value is > 0.05, remove the attribute from the model.
  • Other thresholds besides 0.05 can be used:
    • Higher: More factors included
      • Con: May include irrelevant factors
    • Lower: Less factors included
      • Con: May leave out relevant factors.

Warnings About P-Value

  • With large # of data, p-values get small even when attributes are not related to response.
  • P-values are only probabilities, even when if it seems meaningful.
    • Example: 100 attrs - 0.02 p-value each
    • Each has 2% chance of not being significant.
    • Expect that 2 of them are really irrelevant.
  • Some things to think about (from office hours):
    • Does the effect size seem sensible to support such small p-value or p-values are shrink due to sample size?
    • What happens to p-value a given attribute if others are removed from the regression model?
    • What does the known relationship between response and covariate tell you about this p-value? Is it likely to be a spurious but strong correlation or likely to be a genuine effect?

2: Confidence Interval (CI)ΒΆ

  • Where the coefficient probably lies
  • How close that is to zero.

3: T-StatisticΒΆ

  • The coefficient divided by its standard error.
  • Related to p-value.

4: CoefficientΒΆ

  • Check if coefficient doesn't make any difference, even if low p-value.
  • Example: Estimating household income with age as an attribute.
    • If coefficient of age is 1, even with low p-value, the attribute isn't important.

(A bit more explanation on contextualizing the coefficient of 1.0 in this context explained in Piazza):

Notice that in that lecture (M5L6, timestamp ~3:10) he's talking about the value of the coefficientΒ relative to the scale of the attribute in question and the response value. A coefficient of 1.0 on the age attribute isn't that significant because the coefficient is in units of USD/year of age, and the response is a household income. We can make a decent guess what values the age variable will take (probably adult, probably younger than retirement age), and those values multiplied by 1.0 (on the order of tens of dollars) aren't going to make much difference in a US household income (on the order of tens of thousands of dollars). That phrase "the coefficient multiplied by the attribute value" is important.

So this isn't a rule of thumb about the value 1.0. This is advice to keep the scale of your attributes and response in mind when interpreting coefficients.

5: R-Squared ValueΒΆ

  • Estimate of how much variability your model accounts for.
  • Example:
    • R-square = 59%
    • THis mean model accounts for ~59% of variability in the data.
    • Remaining 41% is randomness or other factors.

Adjusted R-squared: Adjusts for the number of attributes used.

Warnings about R-squared:

  • Some things are not easily modeled, especially when trying to model real-life systems.
  • In these cases, even a R-squared of 0.3-0.4 is quite good.

From Office Hours: W5 HW ReviewΒΆ

Key Assumptions of Linear Regression

  1. Linearity - There needs to exist a linear relationship between X and Y
  2. Independent - (no auto correlation) each point is independent from each other. Time series violates this.
  3. Normlality - Residual (difference between predicted and actual value) needs to be normal. Don't want to see pattern in residuals.
  4. Constant variances (homoscedasticity) - residual need to have constant variance.
  • P-value (in Linear Regression) is probability that the coefficien'ts value is actually zero. Higher P-value means greater probability.
  • Multiple R-squared: Percentage of variance explained by your model (0 - 1)
  • Adjusted R-squared: Counts number of predictors used. More predictors lowers R-squared. Reduce bias in using too many predictors.
  • Big diff between multile vs adjusted R-squared means there are probably too many predictors.

How to calculate R-squared values directly from cross validation

# R-squared = 1 - SSEresiduals/SSEtotal

# total sum of squared differences between data and its mean
SStat <- sum(dat$Crime - mean(data$CRime))^2)

# for model, model2, and cross-validation, calculated SEres
SSres_model <- sum(model$residuals^2) #model 1
SSres_model2 <- sum(model2$residuals^2)
SSres_c <- attr(c, "ms")*nrow(dat) # MSE, times number of data points, gives sum of squared error

# Calculate R-squared
1 - SSres_model/SStat # initial model with insignificant predictors
# 0.803

1 - SSres_model2/SStat # model2 without insignificant predictors (based on p-value)
# 0.766

1 - SSres_c/SStat # cross-validated
# 0.638

# This shows that including the insignificant factors overfits
# compared to removing them, and even the fitted model is probably overfitted.
# This is not suprising since we only have 47 data points and 15 predictors (3:1 ratio)
# Good to have 10:1 or more.

HW 6 Preview - PCAΒΆ

Q0.1 - Using crime datset, apply PCA and then create a regression model using the first few principal components.

  • specify new model in terms of original variables (not principal)
  • compare quality of PCA mode lto that of your solution to Q8.2 (lin reg with no PCA)
  • You can use R function prcomp for PCA (scale data first scale=TRUE)
  • Don't forget to unscale the coefficients to make a prediction for the new city (ie. do the scaling calculation in reverse).

  • Eigenvalues Correspond to Principal Components

  • Eigenvectors correspond to the direction of the new axes.
    • Can find through eigendecomposition, or SVD of data covariance matrix.
    • Data is then projected onto these axes (matmult) to get new features.
    • prcomp does all of this under the hood
    • Still need to scale.
  • PCA will return n principle components: Usually only want to use the first several (usually 2). Using all doesn't make sense because you're not taking advantage of the reduced dimensionality.

How to decide which PC is important

  • prcomp will rank them for us.
  • 1st will have highest eigenvalue, and so forth.
  • You can plot this by cumulative explained variance vs individual explained variance (see plot)

img

Week 6 - Advanced Data PreparationΒΆ

Box-Cox TransformationsΒΆ

Why: Normality assumption

  • Some models assume data is normally distributed. Bias occurs when this assumption is wrong.
  • Left chart is normal, right chart shows heteroscedasticity (unequal variance)
    • (If applying regression model to right chart) Higher variance at upper end can make those estimation errors larger, and make model fit those points better than others.
  • We can account for this via several methods.

img

Box-Cox Transformation

  • What: Transforms a response to eliminate heteroscedasticity.
  • How: Applies logarithmic transformation by:
    1. stretches smaller range to increase its variability.
    2. shrinks larger range to reduce its variability.

Box-Cox Formula:

  • $y$: Vector of responses
  • $t(y)$: Transformed vector
  • $\lambda$: Param to adjust
  • Goal: Transform $t(y)$ to normal distribution
$$t(y) = \frac{y^\lambda - 1}{\lambda}$$

NOTE: First check whether you need the transformation (e.g. QQ plot)

DetrendingΒΆ

Why: Time series data will often have a trend (e.g. inflation, seasons) which may bias the model. Need to adjust for these trends to run a correct analysis.

  • Example: Price of gold over time, raw price versus inflation-adjusted price.
  • If fitting regression model based on other factors with data as-is, it will return a very different response using same factor values depending on the decade.
  • Adjusting inflation will return a closer value regardless of time.

When: Whenever using factor-based model (regression, SVM) to analyze time-series data*

("Factor-based model" uses a bunch of factors to make a prediction, non-factor based model example would be a model using predictors based on time and previous values)

On What:

  • Response
  • Predictors (time series data)

How:

  • Factor-by-factor
    • 1D regression: $y = a_0 + a_1x$
    • Ex. Simple linear regression of gold prices ($Price = -45,600 + 23.2 * Year$)
    • Detrended: $Actual price - (-45,600 + 23.2 * Year)$
  • Example charts: Raw (left), Inflation-adjusted (center), Detrended via linear regression (right)
    • Center and right a bit different bc linear regression accounts for different inflation rates each year.

img

Intro to Principal Component Analysis (PCA)ΒΆ

Why:

  1. We may have too many factors, difficult deciding which ones are important.
    • Ex. Endless number of features to choose from to predict stocks, how to reduce this dimension.
  2. There may be high correlation between some of the predictors. Leads to redundant predictors.

What:

  • PCA transforms data by:
    1. Removing correlations within the data.
    2. Ranking coordinates by importance.
  • Give you n principal components:
    • Concentrate on first n PCs.
    • Pro: Reduce effect of randomness.
    • Pro: Earlier PCs likely to have higher SNR (signal-to-noise ratio).
  • Plot of PCA (chart) in 2 PCs:
    • D_1 becomes new x-coordinate.
    • D_2 becomes new y-coordinate.
    • New coordinate's correlation becomes zero, or orthogonal.
    • PCA ranks PCs by amount of spread in value (e.g. D_1 has more spread than D_2, so it becomes PC_1)

img

How PCA Works (Math)ΒΆ

$X$: Initial matrix of data (scaled)

  • $i$: data point
  • $j$: factor
  • $X_{ij}$: $j$-th factor of data point $i$
  • Scaled such that $\frac{1}{m} \Sigma_i x_{ij} = \mu_j = 0$
    • For each factor $j$, the average value of $x_{ij}$ over all data points is shifted to be zero.

Find all of the eigenvectors of $X^TX$

  • $V$: Matrix of eigenvectors (sorted by eigenvalue)
  • $V = [V_1, V_2 \ldots ]$, where $V_j$ is the $j$-th eigenvector of $X^TX$

PCA:

  • Each principal component is a transformation of the scaled data matrix ($X$) and eigenvector matrix ($V$) (ie. $XV_n$)
  • $XV_1$ = first component, $XV_2$ = second component
  • Formula for $t_{ik}$, ie. the $k$-th new factor value for the $i$-th data point:
$$ t_{ik} = \Sigma^m_{j=1}x_{ij}v_{jk} $$

How to use beyond linear transformation

  • Use kernels for nonlinear functions.
  • Idea is similar to SVM modeling.

How to interpret the model? (Regression)ΒΆ

Problem: How do we get the regression coefficients for the original factors instead of PCs?

Example (Regression): PCA finds new $L$ factors ${t_{ik}}$, then regression finds coefficients $b_0, b_1, \ldots, b_L$.

$$ \begin{aligned} y_i &= b_0 + \Sigma^L_{k=1} b_k t_{ik} \\ &= b_0 + \Sigma^L_{k=1} b_k [ \Sigma^m_{j=1} x_{ij} v_{jk} ] \\ &= b_0 + \Sigma^L_{k=1} b_k [\Sigma^m_{j=1} x_{ij} v_{jk}] \\ &= b_0 + \Sigma^L_{k=1} x_{ij} [\Sigma^L_{k=1} b_k v_{jk}] \\ &= b_0 + \Sigma^L_{k=1} x_{ij} [a_j] \end{aligned} $$

Implied regression coefficient for $x_j$: $$ a_j = \Sigma^L_{k=1} b_k v_{jk} $$

  • Each original attribute’s implied regression coefficient is equal to a linear combination of the principal components’ regression coefficients.

Eigenvalues and EigenvectorsΒΆ

Given $A$: Square matrix

  • If we can find vector $v$ and constant $\lambda$ such that $A_v = \lambda_v$,
  • $v$ is eigenvector of $A$
  • $\lambda$ is eigenvalue of $A$
    • $det(A - \lambda l ) = 0$, meaning:
    • For every value of $\lambda$ where the determinant of $A$ minus $\lambda$ times identity matrix equals zero, every one of those $\lambda$ values is an eigenvalue of $A$.
  • Opposite direction: Given $\lambda$, solve $A_v = \lambda v$ to find corresponding eigenvalue $v$

PCA: Pros and ConsΒΆ

Points to consider:

  • Just because the first principal component has more variation doesn't mean it's always the most helpful for predictive modeling. This is because PCA depends only on independent variables, not response or calculation.
  • This could (on rare occasions) lead to cases where the response is only affected by variables with low variability, not by those with high variability (?)

Example: Classification

  • Left (Good): PCA gives PCs D1 and D2. D1 is helpful for differentiating between blue/red points. D2 essentially gives dividing line between the two colors.
  • Right (Bad): D1 doesn't help separate between red and blue points. D2 is actually the more helpful separator.

img

Takeaway:

  • Usually, dimensions that have more variability (or better differentiators), specifically bc they have more information. But this is not always true.

Week 7 - Advanced Regression (Trees)ΒΆ

Types:

  • CART (Classification and Regression Trees)
  • Decision Tree

Regression TreesΒΆ

Terminology:

  • branch: points where a splitting heuristic is applied.
  • nodes: a group of data points after the data is split.
  • internal nodes: non-terminal nodes
  • leaf nodes are the terminal nodes where there are no more branches beyond it.

img

Method:

  1. Stratify feature space: Split data points by a certain predictor (e.g. age). This is greedy recursive binary splitting (refer to ISL Ch.8).
    • Typically use several predictors to determine split (e.g. age, number of kids, household income).
  2. Create a regression model at each split, using only the data points that belong in each node. This creates multiple regression models.
  3. Prediction: For each new data point pass the data through the decision tree and then use the leaf node's coefficient to return a response.

How regression trees are actually calculated

  • Problem: Running a full regression models on each node is expensive, plus each depth will have higher likelihood of overfitting (because less data).
  • Solution: Just calculate the constant term. We do this by getting the average response in the node:
$$ a_0 = \frac{\Sigma_i\text{in node} y_i}{\text{num data points in node}} = \text{avg response in node} $$

Derivation: img

Other uses of trees The model used at each node would be different:

  1. Logistic regression models: Use fraction of node's data points with "true" response.
  2. Classification models: Use most common classification among node's data points.
  3. Decision models: At each branch determines whether or not we meet a threshold to make a decision (e.g. send marketing email).

BranchingΒΆ

2 things to consider:

  1. Which factor(s) should be in the branching decision?
  2. How should we split the data?

Branching Method

(Note, this is one method - there are other ways to do this)

Key Ideas:

  • Use a metric related to the model's quality.
  • Find the "best factor" to branch with.
  • Verify that the branch actually improves the model. If not, prunce the branch back.

How to reject a potential branch

  • Low improvement benefit (based on threshold)
  • One side of the branch has too few data points now (e.g. each leaf must contain at least 5% of original data).

"Overfitting our model can be costly; make sure the benefit of each branch is greater than its cost"

img

Random ForestsΒΆ

MotivationΒΆ

  • Introduce randomness
  • Generate many trees
  • Average of multiple trees results in better predictions.

Introducing randomnessΒΆ

We introduce randomness in 3 steps:

  • Bootstrapping: Sample N datapoints from original data and build a tree.
  • Branching: Randomly choose a subset of factors, and choose best factor form the subset at each branch.
    • Common number of factors to use is $1+log_n$
  • No pruning.

Effect:

  • Each tree in the forest has slightly different data.
  • We end up with a lot of different trees (usually 500-1000)
  • Each tree gives us a different regression model.

Results:

  1. For regression, use the average predicted response (average).
  2. For classification, use the most common response (mode).

Pros/Cons:

  • Pro: Average between trees can neutralize over-fitting. This gives better predictions.
  • Con: Harder to explain/interpret. Doesn't provide direct weights for each predictor. "Black-box" predictor.

Explainability / InterpretabilityΒΆ

Why explainability matters:

  • Helps us understand the "why" of the model.
  • Helps stakeholders accept our ideas.
  • Sometimes a legal requirement.
  • However, less explainable models may give better results (can fit better to complex data).

Linear Regression ExampleΒΆ

img

Takeaway: Linear regression models are pretty explainable. Each coefficients is tied to a specific predictor.

Regression Tree ExampleΒΆ

img

Takeaway: Tree-based models are less explainable. Branching logic is often not intuitive. Random forest is even worse.

Logistic RegressionΒΆ

Definition: Turns regression model into a probability model.

Standard Linear Regression:

$$ y = a_0 + a_1x_1 + \ldots + a_jx_j $$

Logistic Regression: Takes linear function and puts it into an exponential.

  • $p$: probability of the even you want to observe
$$ \log{\frac{p}{1-p}} = a_0 + a_1x_1 + \ldots + a_jx_j $$$$ p = \frac{1}{1+e^{-(a_0 + a_1x_1 + \ldots + a_jx_j)}} $$

Now this function can return a number from $-\infty$ to $+\infty$.

  • If $-\infty$, then $p=0$
  • If $+\infty$, then $p=1$

Logistic Regression CurveΒΆ

Example 1:

img

  • Size of each dot shows how many observations there are.

Example 2:

img

  • If there are 0/1 responses for almost every predictor value, we can visualize it like above.
  • Each bar shows the fraction of responses that are one for each predictor value
  • Shows how LR curve fits the data.

Logistic Regression vs Linear RegressionΒΆ

Similarities

  • Transforms input data
  • Consider interaction terms
  • Variable selection
  • Single/random logistic regression trees exist.

Differences

  • Longer to compute
  • No closed-form solution
  • Understanding quality of model

Measures of model quality:

  • Linear regression used R-squared value (% of variance in the response that is explained by the model)
  • Logistic regression uses a "pseudo" R-squared value, because it's not really measuring % of variance.

Logistic Regression as ClassificationΒΆ

Thresholding:

  • Answer "yes" if $p > N$, otherwise answer no.
  • Example: if $p \ge 0.7$, give loan.

Receiver Operating Characteristic (ROC) CurveΒΆ

img

Area Under Curve (AUC) = Probability that the model estimates a random "yes" point higher than a random "no" point.

  • Higher means less random prediction (surrogate for quality).
  • Example: Loan payment
    • Joe: repaid loan, Moe: Didn't repay loan.
    • AUC explains probability that model gives Joe's data point a higher response value than Moe.
  • $AUC = 0.5$ means we are just guessing.
  • Also called "corcordance index".
  • Doesn't differentiate cost between False Negative and False Positive__.

Confusion MatrixΒΆ

img

There are many metrics derived from confusion matrix, main ones to know are:

  • $Sensitivity = \frac{True Positive}{True Positive + False Negative}$
  • $Specificity = \frac{True Negative}{True Negative + False Positive}$
  • $Precision = \frac{True Positive}{True Positive + False Positive}$
  • $Recall = \frac{True Positive}{True Positive + False Negative}$

Evaluating a model's quality using Confusion MatrixΒΆ

Example: Spam detection, confusion matrix

img

Idea: We can assign costs for each factor and calculate the sum.

$$ Cost = TP * Cost + FN * Cost + FP * Cost + TN * Cost $$

Example: Cost of lost productivity

  • \$0 for correct classifications.
  • \$0.04 to read spam
  • \$1 to miss a real message
  • Total cost: \$14
$$ \$14 = 490 * \$0 + 10 x \$1 + 100 * \$0.04 + 400 * \$0 $$
  • If ratio of spam changes, add a scaling factor into the equation (new value divided by old value):

img

Poisson regressionΒΆ

Used when we think the response follows a Poisson distribution.

$$ f(z) = \frac{\lambda^ze^{-\lambda}}{z!} $$
  • Example: count arrivals at an airport security line.
    • arrival rate might be a function of time.
    • We estimate $\lambda(x)$

Regression splinesΒΆ

  • Spline: Function of polynomials that connect to each other.
  • Knot: Point at which each function connects to another.
  • Fit different functions to different parts of the data set.

img

Other splines:

  • Order-k regression spline: polynomials are all order k
    • Example: multi-adaptive regression splines (MARS)
    • Called "Earth" in many stat libraries.

Bayesian RegressionΒΆ

We start with:

  • Data, plus
  • Estimate of how regression coefficients and the random error is distributed.

  • Example: Predict how tall a child will be as an adult based on:

    • Data: heights of the child's mother and father
    • Expert opinion: starting distribution.
      • Example: coefficients uniformly distributed between 0.8 and 1.2
    • Use Bayes' theorem to update estimate based on existing data.
    • Most helpful when there is not much data. Use small existing data to temper expert opinion.
    • If no expert opinioni, choose broad prior distribution (e.g. uniform over large interval).

KNN RegressionΒΆ

  • No estimate of prediction function.
  • Plot all the data.
  • Predict response, using average response of K closest data points.

Course Material for Midterm 1 EndΒΆ


Week 8 - Variable SelectionΒΆ

NOTE: These use regression, but can be applied to any model.

MotivationΒΆ

  1. Overfitting: When # of factors >= than # of data points. Begins fitting to random effects.
  2. Simplicity: Simpler modles are better than complex ones.
    • Less expensive
    • Less chance of insignificant factors added (multiple factors with low p-value still sums to significant p-value).
    • Easier to explain
    • Some factors are illegal to use; can't use factor highly correlated with forbidden ones either; Also hard to demosntrate that a complex model is innocuous.

Variable Selection ApproachesΒΆ

Two Schools: Classic and OptimizationΒΆ

ClassicalΒΆ

What:

  • Decision made step by step.
  • Greedy Algorithm: At each step takes one thing that looks best without considering future options.

Types:

  1. Forward Selection
  2. Backward Elimination
  3. Stepwise Regression
OptimizationΒΆ

What: Applies optimization for variable selection.

Types:

  • LASSO
  • Elastic Net

(Clasical) Forward SelectionΒΆ

Definition: Keep adding "good enough" factors until there's no more to add. Then remove those with high p-value.

img

  • Definition of "good enough" and "high p-values" are subjective; Typically 0.1-0.15 for "good enough", p-val > 0.05 for "high p-value".

(Classical) Backward EliminationΒΆ

Definition: Keep removing "bad" factors until there's no more to remove.

img

(Classical) Stepwise Regression (Forward + Reverse)ΒΆ

Definition: Combine forward selection / backward elimination.

img

(Optimization) LASSOΒΆ

What: Adds constraint to standard regression equation (or other algos), then optimizes on it.

Math:

  1. Minimize standard regression equation: $$ \min{\Sigma^m_{i=1}} (y_i - (a_0 + a_1x_{i1} + \ldots + a_nx_{in}))^2 $$

  2. Add "budget" $T$ to minimize squared sum of errors. This "constrains" the sum of coefficients from becoming too large.

    • $a_j$: Sum of coefficients
    • $T$: Constraint factor $$ \text{Subject To } \Sigma^n_{j=1} |a_j| \le \tau $$

Tips:

  • Make sure to scale your data (like SVM)
  • How to choose $T$ - Factors to consider:
    • Number of variables
    • Quality of model
    • Choose $T$ that gives best tradeoff of the two.

(Optimization) Elastic NetΒΆ

What: Constrain combination of absolute value of coefficients and their squares.

  • Similar approach to choosing $T$, but also for $\lambda$.

Math (Minimization same as LASSO):

Add contrainst for coefficients and their squares:

$$ \text{Subject To } \lambda\Sigma^n_{j=1} |a_j| + (1-\lambda)\Sigma^n_{j=1}a_j^2 \le \tau $$

(Aside) Ridge RegressionΒΆ

  • Same as Elastic Net, but removes aboslute value term in the constraint.
  • NOT used for variable selection!
  • Used for prediction.

Math (Minimization same as LASSO):

$$ \text{Subject To } \Sigma^n_{j=1} a_j^2 \le \tau $$

Lasso Regression (LR) vs Ridge Regression (RR)ΒΆ

  • LR/RR mathematically similar, but LR does variable selection and RR does not. Why?

img

  • When plotted in 2D coefficients, LR displays a diamond. Anything within the yellow satisfies the constraint. Same applied to RR shows a circle.

img

  • The difference is how those constrains are applied to the same error function.
  • The function (rearranged) shows that it is a quadratic function (visually as an ellipse). Depending on choice of coefficients, this can increase/decrease in size.

img

  • For every ellipse, every point has same total error.
  • We are trying to find the smallest value of error function, while meeting the optimization constraint.
  • This is shown visually as where the ellipse and shaded area intersect.
  • As total error gets larger, ellipse gets closer to shaded area.
  • The point at which they touch is the smallest error we can find AND satisfies LR constraint.
  • IF the error ellipse touches only one of the corners (e.g. $a_1$, it shows that $a_2$'s coefficient can be zero, hence we can remove that variable
    • Note that ellipse can touch both axis, then we can keep both.
    • The more variables we have and more resticitve $\tau$ is, the more variables we'll likely to see removed.

img

  • Ridge regression has a circle shaded area, so the error function wil most likely touch an area where both $a_1$ and $a_2$ are non-zero.
  • Hence, we rarely see factors get a coefficient of zero (ie variable removal). img

Some more notes from ISLRΒΆ

  • Lasso uses L1 regularization, while ridge regression uses L2.

The "Bias-Variance Tradeoff"ΒΆ

Bias:

  • High bias: Less fit to real patterns (bad)
  • Low variance: Less fit to random patterns (good)
  • Extereme case of high bias is just using intercept: $$ y = a_0 $$

Variance:

  • Low bias: More fit to real patterns (good)
  • High variance: More fit to random patterns (bad)
  • Exterme case of high variance is just using coefficients and no bias.

The goal is to find the "sweet spot" where total errors is at the minimum, and underfit/overfit errors are both minimized.

img

Ridge RegressionΒΆ

Motivation: RR reduces overfitting (variance) by scaling down the magnitude of each coefficient.

  • This is a type of regularization.
  • Coefficients ($a_1$) show magnitude of effect ($x_1$).

Application: Use when model is performing worse on validation data than training data.

  • We can't just reduce fit to random patterns, so we need to just reduce its fit to all patterns.
  • By reduce coefficient magnitude, the differences between the model's responses decrease, thus reducing overfitting.
  • Just be careful not to underfit.

Choosing a Variable Selection ModelΒΆ

Motivation: How to decide which VS model to choose?

Classic options (forward/backward/stepwise):

  • Good for initial analysis.
  • May tend to overfit, hence performance may suffer.

Optimization options (LASSO, Elastic net)

  • Slower but better prediction.
  • Elastic net Pros/Cons:
    • Pros: Has VS benefits of LASSO plus predictive benefits of RR.
    • Cons:
      • Arbitrarily removes multiple correlated predictors. It may end up keeping the more expensive predictor.
      • The RR part may reduce the magnitude of very relevant predictor's coefficient.
  • Best practice: No good rule of thumb, compare the difference between each! Human input always beneficial.

Week 9.1 Design of ExperimentsΒΆ

Design of Experiments (DOE)ΒΆ

Motivation:

  • We can't always get data easily.
  • How to get representative sample (many dimensions, not so many data points)

Comparison and ControlΒΆ

Definition: When designing experiments, we need to control extraneous factors.

BlockingΒΆ

Definition: Factors that creates variation.

Example: Red car vs Blue cars

  • Sports car vs family car can create variation within these subsets.
  • Reduces variance.

A/B TestingΒΆ

Definition: Create control and treatment group(s), test which of two alternatives is better.

Three things that needs to be true:

  1. Collect data quickly
  2. Data must be repreesentative
  3. Amount of data is small compared to whole population

Factorial DesignΒΆ

Motivation: Understand importance of all factors between A/B test.

Definition: Test the effectiveness of every combination.

  • Example: 2 fonts 2 wordings 2 backgrounds = 8 combinations.
  • Run ANOVA (analysis of variance) to determine importance of each factor.
  • What if there are many combinations?
    • Ex: 7 factors, 3 choices each = $3^7$ = 2187 combos.
  • Solution: Fractional factorial design - choosing a subset of those factors.
    • Need to use a balanced set of choices.

img

Independent Factors: Test subset of combos, then use regression to estimate effects.

  • Each factor will have a categorical variable (background color [gold, blue, white] and font size [10,12,14].
  • Use regression to estimate response (e.g. number of clicks).
  • Assumption is that each factor is independent.

Multi-armed BanditsΒΆ

"Exploration vs Exploitation"

  • Need to balance benefit of more information (explore) vs getting immediate value (exploit).
  • Ex: 10 alternatives, 10K tests (1K on each alternative)
    • If first 1K test got maximum results, the rest of 9K nets negative value.

Definition:

  • Run multiple tests and assign more tests to those likely to succeed.
  • Benefit: Explore on the fly but also create more value along the way.

Method:

  1. Decide K alternatives
  2. Start with NO information (equal probability of selecting each alternative)
  3. After N number of tests, update probabilities of each one being the best alternative.
  4. Start assigning new tests according to those probabilities.
    • By giving more weight on the likely best alternatives, we are still exploring but making sure we are exploiting as well.
  5. Repeat until sufficient sure of the best alternative.

Parameters:

  • Number of tests between recalculating probabilities.
  • How to update the probabilities (Bayesian, estimate on observed distributions)
  • How to pick an alternative to test based on probabilities.
    • e.g. look at distribution of each alternative being best, or taking into the expected value of each.

Week 9.2 Probabilistic Models pt.1ΒΆ

Summary:

  • Sometimes a simple model is all you need. Collecting data for lots of factors can be expensive and less ROI.
  • Probability distributions can form the backbone of such simple models.

Bernoulli DistributionΒΆ

Definition:

  • Coin flip (not always 50/50).
  • Useful when we do multiple "coin flips" across several factors.

Probability mass function (PMF)*

$$ P(X=1) = p \\ P(X=0) = 1-p $$

(PMF represents the probability associated with each possible value of the random variable)

Example: Charity donations

  • Charity asks for donation from 1/12 of their mailing list each month.
  • For each person:
    • $P(\text{sends donation}) = p$
    • $P(\text{does not send donation}) = 1-p$
  • Assumptions:
    • Donations are same size each month.
    • Each donor is independent (ie. bernoulli trial for each potential donor).
    • $p$ does not change month to month. Otherwise, the charity won't have the consistent cash flow.
  • The number of donations each month is then binomially distributed.
    • Test: Observe how many successful trials each month, then use the binomial distribution to find the probability of getting at least that observed number of donations.
    • If probability is very low, it is likely the $p$ is different for that month.

Binomial DistributionΒΆ

Definition: Probability of getting $x$ successes out of $n$ independent identically distributed Bernoulli ($p$) trials.

  • Responses must be i.i.d.!
  • ie. $p$ cannot differ between each time period.

Probability Mass Function: $$ P(X=x) = (^n_x)p^x(1-p)^n-x \\ = (\frac{n!}{(n-x)!x!})p^x(1-p)^{n-x} $$

Graph:

  • With greater $n$ the binomial distribution converges to normal distribution. img

Geometric DistributionΒΆ

Definition: Probability of having $x$ Bernoulli($p$) failures until first success.

  • (Or in reverse - Having $x$ Bernoulli(1-p) successes until first failure)
  • Think carefully before deciding $p$ and its opposite $1-p$!

Examples:

  • How many hits until a baseball bat breaks
  • How many good units manufactured before defective one.

Probability Mass Function

$$ P(X=x) = (1-p)^xp $$

Graph: img

Case Study: Assuming that each Bernoulli trial i.i.d - can we compare data to Geometric to test whether i.i.d is true?

  • Example 1: How many hits before a baseball bat breaks.
    • If it fits geometric distribution (hits are i.i.d. Bernoulli trials, hits are i.i.d. Bernoulli trials.
    • If it doesn't fit, hits are not i.i.d. and need to consider other factors.
  • Example 2: Are security personnel more likely to pull someone aside if there hasn't been a search in a while?
    • Question is whether screeners are screening each passenger based on independent factors, or if they are selecting people based on whether or not someone was screened recently.
    • Left chart follows geometric distribution, ie. each person is screened independently.
    • Right chart does not, there are other factors at play (like the last time someone was screened).
  • Example 3: Probability of successful sales call, wher probability of having 5 successful sales calls before the first unsuccessfull call:
    • Probability: $p^5(1-p)$

img

Poisson / Exponential / WeibullΒΆ

Poisson DistributionΒΆ

  • Good at modeling random arrivals
  • $\lambda$: average number of arrivals/time period
  • arrivals are i.i.d.

Probability Mass Function: $$ f_x(x)=\frac{\lambda^xe^-\lambda}{x!} $$

Chart: img

Exponential DistributionΒΆ

Relation to Poisson:

  • If arrival are Poisson($\lambda$), then time between successive arrivals is exponential($\lambda$) distribution.

Probability Mass Function $$ f_x(x) = \lambda e^{-\lambda x} $$

Chart: img

Case study of relation between Poisson and exponential (security line delays at airport):

  • Experiment collected inter-arrival times
  • Data showed inter-arrival times were exponentially distributed,
  • This mean number of arrival per unit time is Poisson.
  • Note: This can work vice-versa as well!

Weibull DistributionΒΆ

  • Geometric models number of tries between failures Definition: The length of time it takes before a failure.
  • Lightbulb example:
    • Geometric: How many lightswitch flips on/off before bulb fail?
    • Weibull: Leave the bulb on; how long until bulb fails?

Probability Mass Function:

  • $\lambda$: scale param
  • $k$: shape param $$ f_x(x) = \frac{k}{\lambda}(\frac{x}{\lambda})^{k-1} e^{-(\frac{x}{\lambda})^k} $$

Intuition of $k$:

  • $k < 1$: Models failure rate decreasing with time (e.g. defects)
  • $k > 1$: Models failure rate increasing with time (e.g. tires)
  • $k = 1$: Models failure rate constant with time.

Relation between Weibull / Poisson:

  • When $k=1$ Weibull = exponential.
  • Just replace $\lambda$ to $\frac{1}{\lambda}$ to get the exponential distribution.

Using software to fit dataΒΆ

  • Input: data
  • Output: Fit to varying distributions / params
  • Only use software output as guidance
    • Example: software return Weibull, k=1.002
    • This is basically exponential (since Weibull k=1 is same as exponential)
    • It may just return Weibull bc data has some noise in it.

Q-Q PlotsΒΆ

Motivation: Even if 2 sets of data have different scale of data / number of data points, 2 similar distributions will have about the same value at each quantile.

  • Example: If 2 sets of students share similar distribution, the height at each percentile should be about the same.

Use:

  • Understand whether 2 sets of data share similar distribution
  • Understand whether 1 set of data fits 1 specified distribution.

Note: Statistical tests can do the same, but sometimes visual representation is easier to understand.

Good fit vs Bad fit: img

  • Good match (left): both distributions fit into 45 degree line of percentiles
  • Bad match (right): Dataset 2 has higher values in the lower percentiles (lower tail) as well as the higher percentiles (higher tail) compared to dataset 1.

Misleading distribution caught by QQplot: img

  • Statistical tests may erroneously say these are good match.
  • Visually, we can see that the dataset 2 has higher values in lower tail and lower values in upper tail.
  • If we care about the extereme ends, the 2 datasets are not a good match.

Testing single datasetΒΆ

  • X-axis: Data
  • Y-axis: Theoretical values of percentiles
  • Note: This is not always the case, some software plot them the other way

Week 9.2 - Probabilistic Models pt.2ΒΆ

Case Study: QueuingΒΆ

img img img

A more complex example: Adding 2 employees to the call center and only hold 4 customers in the queue.

  • Can still give a closed-form answer due to memoryless property

Memoryless PropertyΒΆ

Definition: Doesn't matter what has happened in the past, what matters is where we are now.

Memoryless Exponential distribution: Distribution of remaining time = initial distribution of time.

  • In exponential distribution, it doesn't matter how long an employee has been on the phone. The distribution of the remaining call time is the same.
    • Note: This is purely a property of the exponential distribution (not something we need to know beforehand about phone calls).
    • If phone call time fits this distribution, it is memoryless.

Memoryless Poisson distribution: Distribution of time to next arrival = initial distribution of time to next arrival.

  • Poisson is related to exponential distribution.
  • Poisson interarrival times are exponentially distributed.

Works the other way too:

  • If data fits exponential distribution -> memoryless
  • If we know data is not memoryless -> not exponential

Law firm example: Is it memoryless?ΒΆ

  • Should tire manufacturer pay damages for an accident that happened when tire with 10K miles failed?
  • First need to know if tires generally fail at 10K to understand if is defective, or normal enough tire behavior.
  • Can answer by modeling tire failure:
    • P(tire fail at 10K miles) = ?
    • Assumption: Tires more likely to fail, the more worn out they are.
    • Thus, not memoryless
    • Thus, can't model with exponential distribution
    • Instead we could try Weibull with $k>1$ (which models failure over time)

Queuing example: memoryless?ΒΆ

Potential model params:

  • General arrival distribution $A$
  • General service distribution $S$
  • Number of servers $c$
  • Size of queue $K$
  • Population size $N$
  • Queueing discipline $D$
  • Kendall notation: Standard notation to describe a queueing system.

Other factors that Kendall notation doesn't include:

  • Customers hanging up the phone in frustration
  • Customers "balking" at the queue

Note that Simulation can be used instead for cases where the mathematical model gets too complex (too many factors).

SimulationΒΆ

Definition: Build a model of something to study and analyze its behavior.

Examples:

  • Manufacturing process
  • Airport lines
  • Freight train dispatching

Types of SimulationsΒΆ

Deterministic: No random. Same inputs give same outputs. Stochastic: Use when system has randomness. Focus of this lesson. Continuous-time: Changes happen continuously.

  • Example: Chemical processes, propagation of diseases
  • Often modeled with differential equations

Discrete-event: Changes happen at discrete time points.

  • Example: Call center simulations (someone calls, worker finishes talking to customer)

Discrete event stochastic simulationsΒΆ

  • Valuable when systems have high variability
  • Using average values isn't good enough.

Simulation SoftwareΒΆ

Components:

  • Entities: Things that move through the simulation (e.g. bags, people)
  • Modules: Parts of the process (e.g. queues, storage)
  • Actions
  • Resources (e.g. workers)
  • Decision points
  • Statistical tracking

Under the hood: Uses (pseudo) random number generation, tracking, animation, etc

Replications: number of runs of simulation

  • 1 replication = one data point (unrepresentative)
  • Run multiple times to get distribution of outcomes
  • Example: Simulating average daily throughput over # of replications (x=throughput, y=replications)

img

Validation: Use real data to validate your simulation is giving reasonable results.

  • Make sure to check both average AND variability.
  • Real & simulated avg don't match -> problem
  • Real & simulated variance don't match -> problem

Simulation for Prescriptive AnalticsΒΆ

When to use simulation: Asking "what-if" questions

  • Ex: What is change in throughput with $100K investment in faster machines?
  • Ex: Value of hiring an extra call center worker?
  • Ex: Best distribution of baggage tugs to have at the airport (one for each/every-other gate, or floating pool)?
  • Compare simulated options to determine best course of action.
  • Simulation software may use built-in heuristic optimization to automatically optimize (e.g. best buffer size at each step of manufacturing process)

Simulation ComparisonsΒΆ

Steps:

  1. Design different simulations
  2. Run simulations
  3. Compare metrics between simulations. Important to decide what metrics to use here.
    • Ex: Baggage tugs - Success metric is fraction of bags that get to baggage claim in 20min or less.

Considerations:

  • One set of "random" observations could be better than another.
    • Solution: Use the same random seed for each simulation at each step of the replication (1=random1, 2=random2, ...).
  • Model is only as good as quality of input, missing/incorrect data leads to incorrect answer.
    • Ex: Call center simulation - Wrong assumption that all workers answer calls equally quickly. Leads to costly decisions.

How to Validate SimulationsΒΆ

Problem: Simulations can be validated by comparing it to data that already exists. However, experiments often simulate something that doesn't exist in reality. How to validate?

Example: Simulating spread of 2 interacting diseasesΒΆ

Problem 1: Matching disease prevalence of two diseases after simulation period did not work, because:

  • Multiple models matched the outcome after N simulation days, but
  • These models had very different results after N+1 simulation days.
  • If we can't tell which model is right, we can't tell which is a good public health policy.

Solution 1: Use multiple measures of disease spread (not just one)

  • Still use overall prevalance, but add others (distribution of disease cluster size, largest cluster size, etc)
  • Good model must mathc all metrics to show validation.

Problem 2: None of the metrics didn't match, even though medical paramters followed literature.

  • Why? Initial simulation tracked disease prevalance across entire population, but data from literature typically collected data first from volunteers who had one of the diseases and did contact tracing.
  • Literature likely overestimated metrics based on their data collection method. There are likely more isolated examples in reality.

Solution 2: Run 2 simulations

  1. Simulate disease propagation
  2. Given simulated disease propagation, simulate data collection.

Markov Chain ModelΒΆ

Definition: Models transition probabilities between states of a memoryless system.

How it worksΒΆ

For each state $i$ in the model,

  • $p_{ij}$: Transition probability from state $i$ to state $j$
  • $P={p_{ij}}$: Transition matrix (e.g. weather states)

img

  • Given transition matrix, we can calculate probabilty of moving through multiple states.
    • Ex: Given $\pi=(.5,.25,.25)$ as probabilities of sunny, cloud, rainy today, we get $\pi P=(.525,.25,.225)$ for sunny, cloudy, rainy tomorrow. Can keep multiplying for subsequent days ($\pi P^x$).

Steady stateΒΆ

Definition: Long-term probability that the system will be in each state.

  • Apply $P$ and get initial vector back: $$ \pi^*P=\pi^* $$
  • Solve for such a $\pi^*$ vector $$ \pi^*P = \pi^* \ \text{and} \ \Sigma_i\pi_i^*=1 $$
  • Not all markov chains have a steady state.
    • Can't have cyclical behavior
    • Every state must be reachable from all others.

Key AssumptionΒΆ

Markov chains are memoryless. Biggest limitation.

  • Transitions only depend on the most recent state.
  • Most systems do not exhibit this property.
  • More useful when connecting small amounts of information to find larger ones. Example: Web search: States = web pages
    • For web pages $i$,$j$ $$ p_{ij} = p\text{ if page i links to page. 0 if not} $$
    • Markov chain = jumping randomly from page to page
    • Use the steady-state probabilities to rank web pages (Google Page Rank).

Other UsesΒΆ

Can be used to model things that are not memoryless in the short run, but doesn't matter for a long run system (ex. Model urban sprawl, population, dynamics, disease propagation)

Week 10.1 - Missing DataΒΆ

Some are more likely to be missing, this leads to bias in the missing data.

Examples:

  • Date of death amongst cancer patients, patients that are alive will have field missing. Naive modeling may give overly pessimistic prediction of life span.
  • Reported income, higher earners typically share less data. Naive model may predict low/medium earners well but not for those with high income.

Dealing with Missing DataΒΆ

Throw away data pointsΒΆ

  • Pro: Easy to implement. Won't introduce errors causing bias.
  • Con: Lose data. Potential for censored or biased missing data.

Use categorical variables to indicate missing data.ΒΆ

With categorical data, we can simply add an extra "MISSING" category to the attribute.

Trickier with continuous data, some options:

  • If you add 0 to missing data, it might pull coefficients one way or another to account for missing data.
  • Solution: Interaction terms
    • Creates two variables: One when data is not missing, and one when data is missing.
    • If you craete interaction term against all variables, you're essentially create two separate models (like a tree split).

img

ImputationΒΆ

Definition: Estimate missing values

Midrange Value: Choose mean, median (numeric) or mode (categorical)

  • Pros: Hedge against being too wrong. Easy to compute.
  • Cons: Biased imputation (ex. Income level, higher income less likely to report. Taking average will give low real average).

Regression: Reduce or eliminate the problem of bias.

  • Pros: Usually gives better value than filling average.
  • Cons: Complex - build, fit, validate, test to estimate missing value. Does not capture all the variability.

Pertubation: Add pertubation to each imputed value.

  • Ex. Normally-distributed variation.
  • Less accurate on average, but more accurate variability.

Imputation approaches:

  • Limit the amount of imputation (no more than 5% per factor)
  • Con: Additional errors from imputation? ie, imputation error + perturbation error + model error.
    • Don't worry, regular data also probably has errors.

Week 10.2 OptimizationΒΆ

  • Optimization is closer to prescriptive analytics, sitting above descriptive and predictive analytics. Can guide leadership's decision making.
  • Building optimization models is hard - software automates the solution but the model build is up to you.

Elements of Optimization ModelsΒΆ

Components:

  1. Variables
  2. Constraints
  3. Objective function

1) VariablesΒΆ

Definition: Decisions that the optimization model will find the best values for.

Example: Political candidate scheduling

  • $x_i$: Total time spent in state $i$
  • $y_i$: Number of visits to state $i$
  • $z_i$: 1 if state $i$ is ever visted, 0 if not.
  • $w_{id}$: time spent in state $i$ on day $d$
  • $v_{id}$: 1 if state $i$ visited on day $d$, 0 if not.

2) ConstraintsΒΆ

Definition: Restrictions on the decisions that can be made.

  • $\Sigma_i x_i \leq 30$: Only 30 days allotted for campaign.
  • $\Sigma_{d=24}^{30} v_{\text{Florida},d} \leq 3$: Must visit Florida at least 3 times between days 24-30.
  • $\Sigma_d v_{id} = y_i$: Total visits must add up correctly
  • $\Sigma_i (\alpha p_i \sqrt{x_i + \frac{1}{3}\Sigma_{j \in N(i)} x_j} + \beta v_{id}f_{d})$:

Objective functionΒΆ

Definition: A measure of the quality of a solution, ie. the set of variables which we're trying to maximize or minimize Maximize expected new votes.

  • Example: Maximize expected new votes based on campagin schedule.
  • The components of the function needs to be determined by other analytics models (regression, etc). Oftentimes optimization models require outputs from other models as input.

Solution - Output of Optimization ModelΒΆ

Feasible Solution: Variable values that satisfy all constraints.

Optimal solution: Feasible solution with the best objective value.

Examples of Simple OptimizationΒΆ

Example 1 - Diet Problem (US Army)ΒΆ

Problem: Satisfy soldiers' nutritional requirements at minimum cost.

Definitions:

  • $n$: foods
  • $m$: nutrients
  • $a_{ij}$: amount of nutrient $j$ per unit of food $i$
  • $m_j$: minimum daily intake of nutrient $j$
  • $M_j$: maximum daily intake of nutrient $j$
  • $c_i$: per-unit cost of food $i$
  • $a_{ij}x_i$: Amount of nutrient $j$ obtained from food $i$

Optimization model components:

  1. Variables
    • $x_i$: amount of food $i$ in daily diet.
  2. Constraints
    • $\Sigma_i a_{ij}x_i \geq m_j$ for each nutrient $j$
    • $\Sigma_i a_{ij}x_i \leq M_j$ for each nutrient $j$
    • $x_i \geq 0$ for each food $i$
  3. Objective function
    • $\text{Minimize} \Sigma_i c_i x_i$
  4. Other complexities
    • Variety in diets
    • Seasonal cost variation
    • Taste of food
    • Taste of combinations of foods.

Example 2 - Call center schedulingΒΆ

Problem: Meet forecasted demand $d_i$ for each day of the week $i$

  • Workers work 5 days in a row, then 2 days off
  • Minimize worker-days used

Wrong approach:

  1. Variables
    • $x_i$: number of ppl working on day $i$
  2. Constraints
    • Meet demand: $x_i \geq d_i$ for all days $i$
    • Integer workers: $x_i$ is integer for all days $i$ (can't hire fraction of a worker)
  3. Objective function
    • $\text{Minimize } x_\text{Sunday} + x_\text{Monday} + \ldots + x_\text{Saturday}$

Right approach: Accounts for consecutive worker days. This way, optimization already accounts for union rules.

  1. Variables
    • $x_i$: number of ppl who start working on day $i$
  2. Constraints
    • Meet demand: $\Sigma_\text{j working on day i} x_j \geq d_i$
      • Ex: $x_\text{Fri} + x_\text{Sat} + x_\text{Sun} + x_\text{Mon} + x_\text{Tue} \geq d_\text{Tue}$
    • Integer workers: $x_i$ is integer for all days $i$ (can't hire fraction of a worker)
    • Integerality: $x_i$ is integer for all days $i$.
  3. Objective function
    • $\text{Minimize } 5(x_\text{Sun} + x_\text{Mon} + \ldots + x_\text{Sat})$

Modeling Binary VariablesΒΆ

Idea: Binary variables allow more complex models.

Integer variables in optimizationΒΆ

  • Fixed charges in objective function
  • Contrainst to choose among options
  • Constraints requiring same/opposite decisions
  • If-then constraints

Linear Regression vs OptimizationΒΆ

  • LR just needs the data.
  • Optimization requires user to design a model (variables, constraints, obj function).

Example: Stock Market InvestmentΒΆ

Slide 1:

  • Add fixed charges for each transaction. Add this to objective function $\Sigma_ity_i$ and subtract from existing function.
  • Add a linking constraint to link $x_i$ and $y_i$.

img

Slide 2:

  • Add minimum amount for each stock: $x_i \geq m_i y_i$ for all stocks $i$.
  • Add various investment strategies for specific stocks. These are all reflected as different constraints.

img

Slide 3:

  • If-then constraint: If investing in energy stock, invest in at least 5.

img img

Discovering the Real QuestionsΒΆ

  • Oftentimes, we don't get clear paragraph that contains variables, constraints and objective function.
  • Discovery is hard:
    • Too many constraints to remember or say all at once.
    • Easy to forget obscure constraints.
    • Easy to omit "obvious" (to experts) issues.
    • Example: 5 day work week means 5 consecutive days, no partial shifts.
  • Repeat until done
    • Get information
    • Build model
    • Check with stakeholders (do solutions make sense)
    • If yes, done; if no, repeat.

Violating the Constraints:

  • You might find out that the company is violating constraints.
  • Project might shift to complying with laws.

QuizΒΆ

Let 𝑦𝑖 and 𝑦𝑗 both be binary variables, and let π‘₯𝑖 be a continuous variable. Which expression is equivalent to the constraint β€œIf 𝑦𝑖=1, then 𝑦𝑗 must also be 1?

A: 𝑦𝑗 β‰₯ 𝑦𝑖

InΒ [Β ]: